R takes time to learn, like a spoken language. No one can expect to be an R expert after learning R for a few hours. This worksheet has been designed to introduce you to R, showing some basics, and also some powerful things R can do. The aim is to give you the confidence to continue. After this short introduction you can read this book to dive a bit deeper.

Most R programmers do not remember all the command lines we share in this document. R is a language that is continuously evolving. They use Google extensively to use many new tricks. Do not hesitate to do the same!


1. Intro to R and RStudio

RStudio is an interface that makes it easier to use R. There are four windows in RStudio. The screenshot below shows an analogy linking the different RStudio windows to cooking.

     

R script vs console

There are two ways to work in RStudio in the console or in a script.

Let’s start by running a command in the console.

Your turn 1.1: Run the command below in the console.

1 + 1

Once you’ve typed in the command into your console, just press enter. The output should be printed into the console.

Alternatively, we can use an R script. An R script allows us to have a record (recipe) of what we have done, whilst commands we type into the console are not saved. Keeping a record speeds up our analysis because we can re-use code, its also helpful to remember what we have done before!

Your turn 1.2: Create a script from the top menu in RStudio: File > New File > R Script, then type the command below in your new script.

2 + 2

To run a command in a script, we place the cursor on the line you want to run, and then either:

  • Click on the run button on top of the panel
  • Use Ctrl + Enter (Windows/Linux) or Cmd + Enter (MacOS).

You can also highlight multiple lines at once and run them at once - have a go!


✔️ Recommended Practice: Use Scripts
The console is great for quick tests or running single lines of code. However, for more complex analyses, using scripts is best practice. Scripts help keep your work organized and reproducible. We’ll use scripts for the rest of this workshop.


Commenting

Comments are notes to ourself or others about the commands in the script. They are useful also when you share code with others. Comments start with a # which tells R not to run them as commands.

# testing R
2 + 2

Writing detailed comments and documenting your work are useful reminders to your future self (and anyone else reading your scripts) on what your code does.

Commenting code is good practice, why not try commenting on the code you write in this session to get into the habit, it will also make your R script more informative when you come back to it in the future.


Working directory

Opening an RStudio session launches it from a specific location. This is the ‘working directory’.


ⓘ Understanding the Working Directory
The working directory is the folder where R reads and saves files on your computer. When working in R, you’ll often read data files and write outputs like analysis results or plots. Knowing where your working directory is set helps ensure R finds your files and saves outputs in the right place.


You can find out where the working directory is set to by using two different approaches:

  • Run the command getwd(). This will print out the path to your working directory in the console. It will be in this format: /path/to/working/directory (Mac) or C:\path\to\working\directory (Windows).
  • Click the cog icon/More > Go To Working Directory in the menu at the top of the bottom right panel. This will show you the location and files in your working directory in the files window.

Your turn 1.3: Where is you working directory set to at the moment? Is this a useful place to have it set?

By default the working directory is often your home directory. To keep data and scripts organised its good practice to set your working directory as a specific folder.

Your turn 1.4: Create a folder for this course somewhere on your computer. Name the folder for example, Week 1. Then, to set this folder as your working directory, you can do this in multiple ways, e.g.:

  • Click in the menu at the top on Session > Set Working Directory > Choose directory and choose your folder
  • In the bottom right panel, navigate to the folder that you want to be your working directory. Then click in the menu in the bottom right panel cog icon/More > Set As Working Directory

You will see that once you have set your working directory, the files inside your new folder will appear in the ‘Files’ window on RStudio.

Your turn 1.5: Save the script you created in the previous section as intro.R in this directory. You can do this by clicking on File > Save and the default location should be the current working directory (e.g. Week 1).


🔧 Multiple Ways to Achieve the Same Goal in R
You might have noticed by now that in R, there are often several ways to accomplish the same task. You might find one method more intuitive or easier to use than others — and that’s okay! Experiment, explore, and choose the approach that works best for you.


You might have noticed that when you set your working directory in the previous step, a line appeared in your console saying something like setwd("~/.../Week 1"). As well as the point-and-click methods described above, you can also set your working directory using the setwd() command in the console or in a script.

Your turn 1.6: What might be an advantage of using the command line option (setwd) instead of point-and-click to set your working directory?


Functions

In mathematics, a function defines a relation between inputs and output. In R (and other coding languages) it is the same. A function (also called a command) takes inputs called arguments inside parentheses, and output some results.

We have actually already used two functions in this workshop - getwd() and setwd(). getwd() does not take an input, but outputs your working directory. setwd() takes a path as its input, and sets it as your working directory.

Let’s take a look at some more functions below.

Your turn 1.7: Compare these two outputs. In the second line we use the function sum().

2+2
sum(2,2)

Your turn 1.8: Try using the below function with different inputs, what does it do?

sqrt(9)
sqrt(81)


Objects

It is useful to store data or result so that we can use them later on for other parts of the analysis. To do this, we turn the data into an object (also called a variable). We can use either the operator = or <- to do this. In both cases the object where we store the result is on the left-hand-side, and the result from the operation is on the right-hand-side.

For example, the below code assigns the number 5 to the object X using the = operator. You can print out what the X object is by just typing it into the console or running it in your script.

Your turn 1.9: Make an object called X and print it.

X = 5
X

As described above, you can assign objects using eith the = or <- operator.

Your turn 1.10: Compare the two outputs.

result1 = 2+2
result1

result2 <- 2+3
result2

Once you have assigned objects, you can perform manipulations on them using functions.

Your turn 1.11: Compare the two outputs.

sum(1,2)

X <- 1
Y <- 2
sum(X,Y)

Remember, if you use the same object name multiple times, R will overwrite the previous object you had created.

Your turn 1.12: What is the value of X after running this code?

X <- 5
X <- 10

Your turn 1.13: Can you write code to calculate the sum of the square root of 9 and 16?


✔️ Recommended Practice: Name your objects with care

Use clear and descriptive names for your objects, such as data_raw and data_normalised, instead of vague names like data1 and data2. This makes it easier to track different steps in your analysis.


So far we have looked at objects which are numbers. However objects can also be made of characters, these are called strings. To create a string you need to use speech marks "". Note that you can either use print() to print out your object, or just run the object name directly.

my_string <- "Hello!"
print(my_string)
## [1] "Hello!"

There are a whole host of different objects you can make in R, too many to cover in this session! Later on when we do some data wrangling we will work with objects which are dataframes (i.e. tables) and vectors (a series of numbers and/or strings). Let’s make a simple vector now to get familiar. To make a vector you need to use the command c().

my_vector <- c(1, 2, 3)
my_vector
## [1] 1 2 3
my_new_vector <- c("Hello", "World")
my_new_vector
## [1] "Hello" "World"

Your turn 1.14: Try making an object and setting it as 1:5, what does this object look like?


Once you have a vector, you can subset it. We will cover this further when we do some data wrangling but lets try a simple example here.

my_vector <- c("A", "B", "C")
# extract the first element from the vector
my_vector[1]
## [1] "A"
# extract the last element from the vector
my_vector[3]
## [1] "C"

When you have finished with an object, you can remove it from your environment using the rm() function.

rm(my_vector)

If you want to remove all objects from your environment, you can use the rm(list = ls()) command.

rm(list = ls())

Your turn 1.15: Create a vector from 1 to 10 and print the 9th element of the vector.


2. R Packages

We have seen that functions are really useful tools which can be used to manipulate data. Although some basic functions, like sum() and setwd() are avaliable by default when you install R, some more exciting functions are not. There are thousands of R functions avaliable for you to use, and functions are organised into groups called packages or libraries. An R package contains a collection of functions (usually that perform related tasks), as well as documentation to explain how to use the functions. Packages are made by R developers who wish to share their methods with others.

Once we have identified a package we want to use, we can install and load it so we can use it. Here we will use the tidyverse package which includes lots of useful functions for data managing, we will use the package later in this session.

Your turn 2.1: Install the tidyverse package.

install.packages("tidyverse")

We then load the package in our working directory:

Your turn 2.2: Load the tidyverse package so we can use it.

library(tidyverse)

I need help!

As described above, every R package includes documentation to explain how to use functions. For example, to find out what a function in R does, type a ? before the name and help information will appear in the Help panel on the right in RStudio.

Your turn 2.4: Find out what the sum() command does.

?sum

What is really important is to scroll down the examples to understand how the function can be used in practice. You can use this command line to run the examples:

Your turn 2.5: Run some examples of the sum() command.

example(sum)

Packages also come with more comprehensive documentation called vignettes. These are really helpful to get you started with the package and identify which functions you might want to use.

Your turn 2.6: Have a look at the tidyverse package vignette.

browseVignettes("tidyverse")


Common R errors

R error messages are common and often cryptic. You most likely will encounter at least one error message during this tutorial. Some common reasons for errors are:

  • Case sensitivity. In R, as in other programming languages, case sensitivity is important. ?install.packages is different to ?Install.packages.
  • Missing commas
  • Mismatched parentheses or brackets or unclosed parentheses, brackets or apostrophes
  • Not quoting file paths ("")
  • When a command line is unfinished, the “+” in the console will indicate it is awaiting further instructions. Press ESC to cancel the command.

To see examples of some R error messages with explanations see here


ⓘ More information for when you get stuck

As well as using package vignettes and documentation, Google and Stack Overflow are also useful resources for getting help.


3. Let’s get started with data!

Diamonds dataset

In this worksheet, we will explore the diamonds dataset, so let’s take a minute to familiarize ourselves with it. This dataset contains information about 53,940 round-cut diamonds. The dataset has 10 variables: price (in US dollars), carat (weight of the diamond), cut (quality of the cut), color (diamond colour), clarity (a measurement of how clear the diamond is), depth (total depth percentage), table (width of top of diamond relative to widest point), x (length in mm), y (width in mm), and z (depth in mm).


## Load the package

We use library() to load in the packages that we need. As described in the cooking analogy in the first screenshot, install.packages() is like buying a saucepan, library() is taking it out of the cupboard to use it.

Your turn 3.1: Load your tidyverse library. If you get an error message, it means that you have not installed it! (see the code in the Section 2).

library(tidyverse)


Load the data

The files we will use are in a format called csv (comma-separated values), so we will use the read_csv() function from the tidyverse. There is also a read_tsv() function for tab-separated values.

Your turn 3.2: Download the data file here. Unzip the file and store the content in the data folder in your working directory.

Your turn 3.3: Load the data into your R working directory. We will store the contents of the diamonds file in an object called diamonds_data.

# read in counts file
diamonds_data <- read_csv("data/Diamonds.csv")
## Rows: 53940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cut, color, clarity
## dbl (7): carat, depth, table, price, x, y, z
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

No need to be overwhelmed by the outputs! It tells us what data types read_csv is detecting in each column. Columns with text characters have been detected (col_character) and also columns with numbers (col_double). We won’t get into the details of R data types in this worksheet but they are important to know when you get more proficient in R. You can read more about them in the R for Data Science book.

4. Looking at the data

When assigning a value to an object, R does not print the value. We do not see what is in diamonds_data. But there are ways we can look at the data.

Your turn 4.1: Option 1: Click on the diamonds_data object in your global environment panel on the right-hand-side of RStudio. It will open a new Tab.

Your turn 4.2: Option 2: type the name of the object and this will print the first few lines and some information, such as number of rows. Note that this is similar to how we looked at the value of objects we assigned in section 1.

diamonds_data

We can also take a look the first few lines with head(). This shows us the first 6 lines.

Your turn 4.3: Use head() to look at the first few lines of diamonds_data.

head(diamonds_data)

We can look at the last few lines with tail(). This shows us the last 6 lines. This can be useful to check the bottom of the file, that it looks ok.

Your turn 4.4: Use tail() to look at the last few lines of diamonds_data.

tail(diamonds_data)

Your turn 4.5: What are the colors of the first 6 samples.


Dimension of the data

You can print the number of rows and columns using the function dim().

For example,

dim(diamonds_data)
## [1] 53940    10

diamonds_data has 53,940 rows and 10 columns.

✔️ Tip: Always Verify Your Data Size

Double-check that your data has the expected number of rows and columns. It’s easy to read the wrong file or encounter corrupted downloads. Catching these issues early will save you a lot of trouble later!

Column and row names of the data

Your turn 4.6: Check the column names used in in diamonds_data. Comment on the results you get.

colnames(diamonds_data)


The $ symbol

We can access individual columns by name using the $ symbol.

Your turn 4.7: Extract the ‘price’ column of the diamonds_data

diamonds_data$price


Describing Numeric Variables

To summarize a numeric variable, such as price in the Diamonds_1 dataset, we can use the summary() function to obtain key statistics like minimum, maximum, mean, and quartiles:

summary(diamonds_data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

To visualize the distribution of numeric variables, histograms and boxplots are commonly used:

hist(diamonds_data$price)

boxplot(diamonds_data$price)


Describing Character Variables

For categorical variables such as color, we can use the table() function to count occurrences:

table(diamonds_data$color)
## 
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808

To visualize categorical data, bar plots are useful:

barplot(table(diamonds_data$color))

Filtering Data

To filter a dataset based on a condition, such as selecting only diamonds with color “E”, we can use the filter() function from dplyr:

filter(diamonds_data, color == "E")

Your turn 4.8: filter the data to only include diamonds with a price greater than 5000.


Adding New Variables

To create new variables, such as converting depth from percentage to centimeters, we can use base R or mutate() from dplyr:

diamonds_data$depth_cm <- diamonds_data$depth / 100

Using mutate() from dplyr:

diamonds_data <- mutate(diamonds_data, price_in_hundreds = diamonds_data$price / 100)

5. Pipe operator

The pipe operator %>% is a powerful tool in R that allows you to chain together multiple operations. It is part of the tidyverse package.

Before we go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator.

  • f(x) can be rewritten as x %>% f()

In short, the pipe operator is used to pass the output of one function to the input of another function. This makes your code more readable and easier to understand.

log(2)
## [1] 0.6931472
2 %>% log()
## [1] 0.6931472

Your turn 5.1: Write the following codes using the pipe operator.

sin(cos(0.5))
print(rev(unique(sort(c(5, 3, 9, 1, 4, 7, 1, 4)))))